Valores missing, outliers y correlación¶

En este notebook continuamos con el preprocesamiento y análisis exploratorio del dataset procesado previamente. Nos enfocamos en profundizar en las relaciones entre variables y en la preparación de datos para modelos predictivos. Nos enfocaremos en el análisis de distribuciones, correlaciones y segmentaciones, además de explorar posibles valores atípicos y relaciones clave entre las variables y el objetivo (TARGET).

Vamos a seguir los siguientes pasos:

  • Cambio de tipos de variables
  • Separación en train y test
  • Visualización descriptiva de los datos
  • Tratamiento de las variables continuas: correlaciones de pearson, estudio de outliers y estudio de valores missing
  • Tratamiento de las variables categoricas: estudio de correlaciones con vCramer, relleno de valores missing

Librerias¶

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.io as pio
import scipy.stats as stats
from scipy.stats import chi2_contingency
import warnings
import sys

Funciones¶

In [2]:
sys.path.append('../src/')
import functions as f
sys.path.remove('../src/')

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 10000)
pd.set_option('display.width', 10000)
In [3]:
#Constantes
seed= 12345

Carga de los datos¶

In [4]:
# Carga de los datos
pd_data = pd.read_csv("../data/df_data_1.csv").set_index('SK_ID_CURR')
pd_data.head()
Out[4]:
COMMONAREA_AVG NONLIVINGAPARTMENTS_AVG FONDKAPREMONT_MODE LIVINGAPARTMENTS_AVG FLOORSMIN_AVG YEARS_BUILD_AVG OWN_CAR_AGE LANDAREA_AVG BASEMENTAREA_AVG EXT_SOURCE_1 NONLIVINGAREA_AVG ELEVATORS_AVG WALLSMATERIAL_MODE APARTMENTS_AVG ENTRANCES_AVG LIVINGAREA_AVG HOUSETYPE_MODE FLOORSMAX_AVG YEARS_BEGINEXPLUATATION_AVG TOTALAREA_MODE EMERGENCYSTATE_MODE OCCUPATION_TYPE EXT_SOURCE_3 ORGANIZATION_TYPE AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR NAME_TYPE_SUITE DEF_60_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE EXT_SOURCE_2 AMT_GOODS_PRICE AMT_ANNUITY CODE_GENDER CNT_FAM_MEMBERS DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_10 FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 REGION_RATING_CLIENT_W_CITY FLAG_DOCUMENT_11 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_INCOME_TOTAL CNT_CHILDREN FLAG_OWN_REALTY FLAG_OWN_CAR FLAG_DOCUMENT_12 AMT_CREDIT WEEKDAY_APPR_PROCESS_START NAME_INCOME_TYPE HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY REGION_RATING_CLIENT FLAG_EMAIL FLAG_PHONE FLAG_CONT_MOBILE FLAG_WORK_PHONE NAME_CONTRACT_TYPE FLAG_EMP_PHONE FLAG_MOBIL DAYS_ID_PUBLISH DAYS_REGISTRATION DAYS_EMPLOYED DAYS_BIRTH REGION_POPULATION_RELATIVE NAME_HOUSING_TYPE NAME_FAMILY_STATUS NAME_EDUCATION_TYPE TARGET
SK_ID_CURR
100002 0.0143 0.0000 reg oper account 0.0202 0.1250 0.6192 NaN 0.0369 0.0369 0.083037 0.0000 0.00 Stone, brick 0.0247 0.0690 0.0190 block of flats 0.0833 0.9722 0.0149 No Laborers 0.139376 Business Entity Type 3 0.0 0.0 0.0 0.0 0.0 1.0 Unaccompanied 2.0 2.0 2.0 2.0 0.262949 351000.0 24700.5 M 1.0 -1134.0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 202500.0 0 Y N 0 406597.5 3 Working 10 0 0 0 0 0 0 2 0 1 1 0 Cash loans 1 1 -2120 -3648.0 -637 -9461 0.018801 House / apartment Single / not married Secondary / secondary special 1
100003 0.0605 0.0039 reg oper account 0.0773 0.3333 0.7960 NaN 0.0130 0.0529 0.311267 0.0098 0.08 Block 0.0959 0.0345 0.0549 block of flats 0.2917 0.9851 0.0714 No Core staff NaN School 0.0 0.0 0.0 0.0 0.0 0.0 Family 0.0 1.0 0.0 1.0 0.622246 1129500.0 35698.5 F 2.0 -828.0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 270000.0 0 N N 0 1293502.5 1 State servant 11 0 0 0 0 0 0 1 0 1 1 0 Cash loans 1 1 -291 -1186.0 -1188 -16765 0.003541 House / apartment Married Higher education 0
100004 NaN NaN NaN NaN NaN NaN 26.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Laborers 0.729567 Government 0.0 0.0 0.0 0.0 0.0 0.0 Unaccompanied 0.0 0.0 0.0 0.0 0.555912 135000.0 6750.0 M 1.0 -815.0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 67500.0 0 Y Y 0 135000.0 1 Working 9 0 0 0 0 0 0 2 0 1 1 1 Revolving loans 1 1 -2531 -4260.0 -225 -19046 0.010032 House / apartment Single / not married Secondary / secondary special 0
100006 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Laborers NaN Business Entity Type 3 NaN NaN NaN NaN NaN NaN Unaccompanied 0.0 2.0 0.0 2.0 0.650442 297000.0 29686.5 F 2.0 -617.0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 135000.0 0 Y N 0 312682.5 3 Working 17 0 0 0 0 0 0 2 0 0 1 0 Cash loans 1 1 -2437 -9833.0 -3039 -19005 0.008019 House / apartment Civil marriage Secondary / secondary special 0
100007 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Core staff NaN Religion 0.0 0.0 0.0 0.0 0.0 0.0 Unaccompanied 0.0 0.0 0.0 0.0 0.322738 513000.0 21865.5 M 1.0 -1106.0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 121500.0 0 Y N 0 513000.0 4 Working 11 0 0 0 0 1 1 2 0 0 1 0 Cash loans 1 1 -3458 -4311.0 -3038 -19932 0.028663 House / apartment Single / not married Secondary / secondary special 0
In [5]:
pd_data.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'>
Index: 307511 entries, 100002 to 456255
Data columns (total 93 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   COMMONAREA_AVG               92646 non-null   float64
 1   NONLIVINGAPARTMENTS_AVG      93997 non-null   float64
 2   FONDKAPREMONT_MODE           97216 non-null   object 
 3   LIVINGAPARTMENTS_AVG         97312 non-null   float64
 4   FLOORSMIN_AVG                98869 non-null   float64
 5   YEARS_BUILD_AVG              103023 non-null  float64
 6   OWN_CAR_AGE                  104582 non-null  float64
 7   LANDAREA_AVG                 124921 non-null  float64
 8   BASEMENTAREA_AVG             127568 non-null  float64
 9   EXT_SOURCE_1                 134133 non-null  float64
 10  NONLIVINGAREA_AVG            137829 non-null  float64
 11  ELEVATORS_AVG                143620 non-null  float64
 12  WALLSMATERIAL_MODE           151170 non-null  object 
 13  APARTMENTS_AVG               151450 non-null  float64
 14  ENTRANCES_AVG                152683 non-null  float64
 15  LIVINGAREA_AVG               153161 non-null  float64
 16  HOUSETYPE_MODE               153214 non-null  object 
 17  FLOORSMAX_AVG                154491 non-null  float64
 18  YEARS_BEGINEXPLUATATION_AVG  157504 non-null  float64
 19  TOTALAREA_MODE               159080 non-null  float64
 20  EMERGENCYSTATE_MODE          161756 non-null  object 
 21  OCCUPATION_TYPE              211120 non-null  object 
 22  EXT_SOURCE_3                 246546 non-null  float64
 23  ORGANIZATION_TYPE            252137 non-null  object 
 24  AMT_REQ_CREDIT_BUREAU_HOUR   265992 non-null  float64
 25  AMT_REQ_CREDIT_BUREAU_DAY    265992 non-null  float64
 26  AMT_REQ_CREDIT_BUREAU_WEEK   265992 non-null  float64
 27  AMT_REQ_CREDIT_BUREAU_MON    265992 non-null  float64
 28  AMT_REQ_CREDIT_BUREAU_QRT    265992 non-null  float64
 29  AMT_REQ_CREDIT_BUREAU_YEAR   265992 non-null  float64
 30  NAME_TYPE_SUITE              306219 non-null  object 
 31  DEF_60_CNT_SOCIAL_CIRCLE     306490 non-null  float64
 32  OBS_60_CNT_SOCIAL_CIRCLE     306490 non-null  float64
 33  DEF_30_CNT_SOCIAL_CIRCLE     306490 non-null  float64
 34  OBS_30_CNT_SOCIAL_CIRCLE     306490 non-null  float64
 35  EXT_SOURCE_2                 306851 non-null  float64
 36  AMT_GOODS_PRICE              307233 non-null  float64
 37  AMT_ANNUITY                  307499 non-null  float64
 38  CODE_GENDER                  307507 non-null  object 
 39  CNT_FAM_MEMBERS              307509 non-null  float64
 40  DAYS_LAST_PHONE_CHANGE       307510 non-null  float64
 41  FLAG_DOCUMENT_10             307511 non-null  int64  
 42  FLAG_DOCUMENT_2              307511 non-null  int64  
 43  FLAG_DOCUMENT_3              307511 non-null  int64  
 44  FLAG_DOCUMENT_4              307511 non-null  int64  
 45  FLAG_DOCUMENT_5              307511 non-null  int64  
 46  FLAG_DOCUMENT_6              307511 non-null  int64  
 47  FLAG_DOCUMENT_7              307511 non-null  int64  
 48  FLAG_DOCUMENT_8              307511 non-null  int64  
 49  FLAG_DOCUMENT_9              307511 non-null  int64  
 50  REGION_RATING_CLIENT_W_CITY  307511 non-null  int64  
 51  FLAG_DOCUMENT_11             307511 non-null  int64  
 52  FLAG_DOCUMENT_13             307511 non-null  int64  
 53  FLAG_DOCUMENT_14             307511 non-null  int64  
 54  FLAG_DOCUMENT_15             307511 non-null  int64  
 55  FLAG_DOCUMENT_16             307511 non-null  int64  
 56  FLAG_DOCUMENT_17             307511 non-null  int64  
 57  FLAG_DOCUMENT_18             307511 non-null  int64  
 58  FLAG_DOCUMENT_19             307511 non-null  int64  
 59  FLAG_DOCUMENT_20             307511 non-null  int64  
 60  FLAG_DOCUMENT_21             307511 non-null  int64  
 61  AMT_INCOME_TOTAL             307511 non-null  float64
 62  CNT_CHILDREN                 307511 non-null  int64  
 63  FLAG_OWN_REALTY              307511 non-null  object 
 64  FLAG_OWN_CAR                 307511 non-null  object 
 65  FLAG_DOCUMENT_12             307511 non-null  int64  
 66  AMT_CREDIT                   307511 non-null  float64
 67  WEEKDAY_APPR_PROCESS_START   307511 non-null  int64  
 68  NAME_INCOME_TYPE             307511 non-null  object 
 69  HOUR_APPR_PROCESS_START      307511 non-null  int64  
 70  REG_REGION_NOT_LIVE_REGION   307511 non-null  int64  
 71  REG_REGION_NOT_WORK_REGION   307511 non-null  int64  
 72  LIVE_REGION_NOT_WORK_REGION  307511 non-null  int64  
 73  REG_CITY_NOT_LIVE_CITY       307511 non-null  int64  
 74  REG_CITY_NOT_WORK_CITY       307511 non-null  int64  
 75  LIVE_CITY_NOT_WORK_CITY      307511 non-null  int64  
 76  REGION_RATING_CLIENT         307511 non-null  int64  
 77  FLAG_EMAIL                   307511 non-null  int64  
 78  FLAG_PHONE                   307511 non-null  int64  
 79  FLAG_CONT_MOBILE             307511 non-null  int64  
 80  FLAG_WORK_PHONE              307511 non-null  int64  
 81  NAME_CONTRACT_TYPE           307511 non-null  object 
 82  FLAG_EMP_PHONE               307511 non-null  int64  
 83  FLAG_MOBIL                   307511 non-null  int64  
 84  DAYS_ID_PUBLISH              307511 non-null  int64  
 85  DAYS_REGISTRATION            307511 non-null  float64
 86  DAYS_EMPLOYED                307511 non-null  int64  
 87  DAYS_BIRTH                   307511 non-null  int64  
 88  REGION_POPULATION_RELATIVE   307511 non-null  float64
 89  NAME_HOUSING_TYPE            307511 non-null  object 
 90  NAME_FAMILY_STATUS           307511 non-null  object 
 91  NAME_EDUCATION_TYPE          307511 non-null  object 
 92  TARGET                       307511 non-null  int64  
dtypes: float64(37), int64(41), object(15)
memory usage: 220.5+ MB

Tras procesar los datos en el notebook anterior, contamos con un dataset depurado que ha sido limpiado de columnas redundantes. El dataset cuenta ahora con 93 columnas y 307,511 filas, como se puede observar al realizar un análisis inicial. Este análisis preliminar confirma que el formato del dataset es adecuado para continuar con la exploración y transformación.

Cambio de tipo a variables categóricas¶

Muchas de las columnas tienen naturaleza categórica, pero están almacenadas como objetos (object). Para optimizar su manejo, transformamos estas columnas a categoricas, lo que mejora el rendimiento del análisis, reduce el uso de memoria y facilita operaciones estadísticas.

In [6]:
pd_data[['CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE']]=pd_data[['CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE']].astype('category')
In [7]:
pd_data[['FONDKAPREMONT_MODE', 'NAME_CONTRACT_TYPE']].astype('category')  
Out[7]:
FONDKAPREMONT_MODE NAME_CONTRACT_TYPE
SK_ID_CURR
100002 reg oper account Cash loans
100003 reg oper account Cash loans
100004 NaN Revolving loans
100006 NaN Cash loans
100007 NaN Cash loans
... ... ...
456251 reg oper account Cash loans
456252 reg oper account Cash loans
456253 reg oper account Cash loans
456254 NaN Cash loans
456255 NaN Cash loans

307511 rows × 2 columns

Por ejemplo, la columna NAME_CONTRACT_TYPE contiene categorías como Cash loans y Revolving loans, que serán útiles para analizar patrones de comportamiento financiero. Mientras tanto, FONDKAPREMONT_MODE incluye datos categóricos relevantes como reg oper account, que también puede ser interpretado en análisis posteriores.

Para comenzar con el análisis de las variables, clasificamos las columnas del dataset según su tipo: booleanas, categóricas y numéricas. Este paso es fundamental para decidir los métodos de análisis y preprocesamiento que aplicaremos a cada tipo de variable.

In [8]:
pd_data_bool, pd_data_cat, pd_data_num = f.tipos_vars(pd_data, False)

En el conjunto de datos, algunas variables clasificadas inicialmente como categóricas contienen valores que en realidad corresponden a datos numéricos. Estas variables necesitan ser trasladadas al grupo de variables numéricas para garantizar su tratamiento adecuado en etapas posteriores.

In [9]:
variables_to_move = [
    'CNT_FAM_MEMBERS', 'CNT_CHILDREN', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
    'DEF_60_CNT_SOCIAL_CIRCLE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_HOUR',
    'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_YEAR',
    'HOUR_APPR_PROCESS_START'
]

Eliminamos las variables identificadas como numéricas del grupo categórico y las añadimos al grupo de variables numéricas. Este proceso asegura que las columnas se utilicen de manera coherente con su naturaleza.

In [10]:
pd_data_cat = [var for var in pd_data_cat if var not in variables_to_move]
pd_data_num.extend([var for var in variables_to_move if var in pd_data.columns])

Tras este ajuste, confirmamos que las variables han sido correctamente reclasificadas. Esto permite evitar errores durante el análisis exploratorio y el modelado posterior.

Para garantizar cálculos precisos y evitar errores de tipo, convertimos todas las variables numéricas al tipo de dato float.

In [11]:
pd_data[pd_data_num] = pd_data[pd_data_num].astype(float)

Este cambio estandariza las variables numéricas, permitiendo la aplicación uniforme de operaciones estadísticas y matemáticas.

Las variables booleanas FLAG_OWN_CAR, FLAG_OWN_REALTY y EMERGENCYSTATE_MODE, contienen valores de texto ("Yes/No", "Y/N"). Para simplificar su análisis y modelado, convertimos estas variables a formato binario (1 para 'YES/Y' y 0 para 'NO/N').

In [12]:
pd_data['FLAG_OWN_CAR'] = pd_data['FLAG_OWN_CAR'].apply(lambda x: 1 if x == 'Y' else 0)
pd_data['FLAG_OWN_REALTY'] = pd_data['FLAG_OWN_REALTY'].apply(lambda x: 1 if x == 'Y' else 0)
pd_data['EMERGENCYSTATE_MODE'] = pd_data['EMERGENCYSTATE_MODE'].apply(lambda x: 1 if x == 'Yes' else 0)

Además, cuando al final del notebook anterior (01), sustituimos los valores XNA por missing en la variable CODE_GENDER, se convirtió también en una variable booleana, por lo que vamos a asignar ceros al valor M (hombre) y unos al valor F (mujer).

In [13]:
pd_data['CODE_GENDER'] = pd_data['CODE_GENDER'].apply(lambda x: 1 if x == 'F' else 0)

Por otro lado, la variable NAME_CONTRACT_TYPE, que habíamos clasificado originalmente como booleana, la vamos a considerar como categórica, ya que sus posibles valores son los strings Cash loans y Revolving Loans.

In [14]:
pd_data_cat.append('NAME_CONTRACT_TYPE')

pd_data_bool.remove('NAME_CONTRACT_TYPE')

Separación en Train y Test estratificado¶

Antes de realizar la separación en conjuntos de Train y Test, es importante conocer la distribución de la variable objetivo (TARGET). Esto nos permitirá garantizar que se mantenga una proporción similar entre ambas clases en los dos subconjuntos.

In [15]:
data_target = pd_data['TARGET']\
    .value_counts(normalize=True)\
    .mul(100).rename('percent').reset_index()

data_target_conteo=pd_data['TARGET'].value_counts()
data_plot_target = pd.merge(data_target, data_target_conteo, on='TARGET')
data_plot_target
Out[15]:
TARGET percent count
0 0 91.927118 282686
1 1 8.072882 24825
In [16]:
fig, ax = plt.subplots()
sns.barplot(data=data_plot_target, x='TARGET', y='percent', ax=ax, hue='TARGET', legend=False)
ax.set_title('Conteo de valores de la variable Target', fontdict={'fontsize':15}) #,'fontname':'Montserrat','weight':'bold'
ax.set_ylabel('Percentage')
ax.set_xlabel('Value')
for i in ax.containers:
    ax.bar_label(i,fmt='{:,.2f}')
plt.show()
No description has been provided for this image

Este gráfico de barras muestra la proporción de cada clase (0 y 1). Observamos que la clase 0 es dominante (91.93%) frente a la clase 1 (8.07%). Este desbalance puede influir en el rendimiento de los modelos y debe tenerse en cuenta a la hora de separar los datos.

Para evaluar correctamente el rendimiento de los modelos, dividimos los datos en dos conjuntos:

Train: Para entrenar el modelo.

Test: Para validar el rendimiento en datos no vistos. En este caso, utilizamos una separación estratificada para mantener la misma proporción de las clases (0 y 1) en ambos conjuntos

In [17]:
from sklearn.model_selection import train_test_split
X_data_train, X_data_test, y_data_train, y_data_test = train_test_split(pd_data.drop('TARGET',axis=1)
                                                                         ,pd_data['TARGET']
                                                                         ,stratify=pd_data['TARGET']
                                                                         ,test_size=0.2
                                                                         ,random_state=seed)
pd_data_train = pd.concat([X_data_train, y_data_train],axis=1)
pd_data_test = pd.concat([X_data_test, y_data_test],axis=1)

train_test_split divide los datos en proporción 80% (entrenamiento) y 20% (prueba). La opción stratify asegura que la proporción de las clases en TARGET sea consistente con la distribución que tiene en el dataset.

In [18]:
print('== Train\n', pd_data_train['TARGET'].value_counts(normalize=True))
print('== Test\n', pd_data_test['TARGET'].value_counts(normalize=True))
== Train
 TARGET
0    0.919271
1    0.080729
Name: proportion, dtype: float64
== Test
 TARGET
0    0.919272
1    0.080728
Name: proportion, dtype: float64

Visualización descriptiva de los datos¶

Distribución del resto de variables¶

Hacemos gráficas de la distribución de cada variable, y de su distribución condicionada al valor de 'TARGET', para ver si podemos extraer algunas conclusiones provisionales sobre qué variables son más importantes. Para ello hemos creado una función que permite graficar las variables de forma adecuada a su estructura.

In [19]:
warnings.filterwarnings('ignore')
for i in list(pd_data_train.columns):
    if i in pd_data_num:
        f.custom_plot(pd_data_train, col_name=i, is_cont=True, target='TARGET')
    elif  ((i in pd_data_bool) | (i in pd_data_cat)) & (i!='TARGET'):
        f.custom_plot(pd_data_train, col_name=i, is_cont=False, target='TARGET')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Al analizar las variables visualmente, surgen algunos patrones interesantes. Algunas anotaciones clave basadas en los gráficos incluyen:

  1. Relación entre ingresos y riesgo de incumplimiento:

Se observa que el fraude o riesgo de incumplimiento (TARGET=1) parece correlacionarse con ingresos más bajos o intermedios (NAME_INCOME_TYPE y AMT_INCOME_TOTAL). Sin embargo, las distribuciones también reflejan la predominancia de categorías como "Working" y "Commercial associate", lo que podría sesgar los resultados.

  1. Impacto de la ocupación y nivel educativo:

Las categorías ocupacionales (OCCUPATION_TYPE) como "Low-skill laborers" muestran un porcentaje más alto de incumplimientos. Algo similar ocurre con niveles educativos más bajos (NAME_EDUCATION_TYPE), como "Secondary / secondary special", que presenta una mayor proporción de fraudes en comparación con niveles superiores. Esto refuerza la hipótesis de que las características socioeconómicas juegan un papel relevante en la predicción del riesgo.

  1. Diferencias demográficas importantes:

En las variables como DAYS_BIRTH y DAYS_EMPLOYED, hay indicios de que las personas más jóvenes tienden a tener una mayor proporción de TARGET=1, mientras que la estabilidad laboral prolongada está asociada a menores niveles de incumplimiento.

  1. Regiones y movilidad:

Variables como REG_CITY_NOT_WORK_CITY y LIVE_CITY_NOT_WORK_CITY muestran una mayor proporción de fraudes en clientes que residen o trabajan en diferentes ciudades. Esto puede reflejar riesgos asociados a movilidad geográfica o inestabilidad en el lugar de residencia.

  1. Documentos presentados:

Las variables relacionadas con FLAG_DOCUMENT no muestran una relación clara con el TARGET, aunque podrían indicar problemas en la documentación presentada por ciertos clientes. Esto podría ser más relevante en combinaciones con otras variables.

  1. Regiones con mayor riesgo:

Variables como REGION_POPULATION_RELATIVE muestran que las áreas menos densamente pobladas tienden a tener un mayor riesgo asociado. Sin embargo, las diferencias entre los valores de TARGET no son drásticas.

  1. Riesgo y edad de la vivienda:

Las variables relacionadas con la antigüedad de las propiedades (YEARS_BEGINEXPLUATATION_AVG) reflejan una ligera tendencia hacia un mayor riesgo de incumplimiento en propiedades más antiguas. Esto puede estar relacionado con la capacidad económica de los propietarios para mantener estas propiedades.

  1. Desbalance de la variable objetivo:

La variable objetivo está muy desbalanceada, con una proporción muy baja de TARGET=1. Esto sugiere que será necesario implementar técnicas de re-balanceo, como oversampling o ajustes específicos en el modelado.

Los gráficos revelan tendencias importantes que deben tenerse en cuenta en el preprocesamiento y modelado. Muchas variables parecen tener potencial predictivo, aunque la presencia de outliers, valores imputados y distribuciones altamente sesgadas requiere especial atención. Es importante profundizar en las interacciones entre variables clave y realizar pruebas estadísticas para confirmar estas hipótesis antes de proceder al modelado.

Tratamiento de las variables continuas¶

Tratamiento de outliers¶

Los valores outlier se pueden sustituir por la media, mediana, valores extremos (media+3std o media-3std). Tras el siguiente análisis, hemos decidido como primera iteración dejarlos sin sustituir. Una vez llegue al modelo puedo realizar iteraciones utilizando diferentes métodos para comprobar si mejora el modelo

In [20]:
f.get_deviation_of_mean_perc(pd_data_train, pd_data_num, target='TARGET', multiplier=3)
Out[20]:
0.0 1.0 variable sum_outlier_values porcentaje_sum_null_values
0 0.951237 0.048763 COMMONAREA_AVG 1374 0.005585
1 0.942808 0.057192 NONLIVINGAPARTMENTS_AVG 577 0.002345
2 0.948107 0.051893 LIVINGAPARTMENTS_AVG 1426 0.005797
3 0.957717 0.042283 FLOORSMIN_AVG 473 0.001923
4 0.923158 0.076842 YEARS_BUILD_AVG 950 0.003862
5 0.914980 0.085020 OWN_CAR_AGE 2717 0.011044
6 0.936272 0.063728 LANDAREA_AVG 1679 0.006825
7 0.945352 0.054648 BASEMENTAREA_AVG 1592 0.006471
8 0.947799 0.052201 NONLIVINGAREA_AVG 1954 0.007943
9 0.956456 0.043544 ELEVATORS_AVG 1975 0.008028
10 0.950495 0.049505 APARTMENTS_AVG 2424 0.009853
11 0.940090 0.059910 ENTRANCES_AVG 1786 0.007260
12 0.951437 0.048563 LIVINGAREA_AVG 2574 0.010463
13 0.956226 0.043774 FLOORSMAX_AVG 2056 0.008357
14 0.911817 0.088183 YEARS_BEGINEXPLUATATION_AVG 567 0.002305
15 0.955490 0.044510 TOTALAREA_MODE 2696 0.010959
16 0.961401 0.038599 AMT_GOODS_PRICE 3368 0.013691
17 0.962696 0.037304 AMT_ANNUITY 2359 0.009589
18 0.955426 0.044574 DAYS_LAST_PHONE_CHANGE 516 0.002097
19 0.943662 0.056338 AMT_INCOME_TOTAL 213 0.000866
20 0.959833 0.040167 AMT_CREDIT 2639 0.010727
21 0.964942 0.035058 DAYS_REGISTRATION 599 0.002435
22 0.959401 0.040599 REGION_POPULATION_RELATIVE 6749 0.027434
23 0.899189 0.100811 CNT_FAM_MEMBERS 3204 0.013024
24 0.898623 0.101377 CNT_CHILDREN 3413 0.013874
25 0.881130 0.118870 DEF_30_CNT_SOCIAL_CIRCLE 5485 0.022296
26 0.911427 0.088573 OBS_60_CNT_SOCIAL_CIRCLE 4787 0.019459
27 0.874041 0.125959 DEF_60_CNT_SOCIAL_CIRCLE 3128 0.012715
28 0.911490 0.088510 OBS_30_CNT_SOCIAL_CIRCLE 4926 0.020024
29 0.921172 0.078828 AMT_REQ_CREDIT_BUREAU_WEEK 6825 0.027743
30 0.925170 0.074830 AMT_REQ_CREDIT_BUREAU_HOUR 1323 0.005378
31 0.945904 0.054096 AMT_REQ_CREDIT_BUREAU_MON 2551 0.010370
32 0.910907 0.089093 AMT_REQ_CREDIT_BUREAU_QRT 1852 0.007528
33 0.904318 0.095682 AMT_REQ_CREDIT_BUREAU_DAY 1181 0.004801
34 0.908453 0.091547 AMT_REQ_CREDIT_BUREAU_YEAR 2709 0.011012
35 0.894523 0.105477 HOUR_APPR_PROCESS_START 493 0.002004

Al evaluar las variables continuas, se observa que existe un porcentaje significativo de valores atípicos en varias de ellas, lo que podría influir de manera importante en el comportamiento del modelo. A continuación, se destacan los puntos clave del análisis:

Presencia de Outliers en Variables Clave:

  • Las variables como COMMONAREA_AVG, NONLIVINGAPARTMENTS_AVG y LIVINGAPARTMENTS_AVG presentan valores atípicos en un rango cercano al 0.05% del total de registros. Aunque estos porcentajes pueden parecer bajos, podrían tener un impacto en variables con distribuciones no uniformes.

  • Variables relacionadas con características de vivienda como OWN_CAR_AGE, LANDAREA_AVG, y BASEMENTAREA_AVG también presentan niveles de outliers que superan el 0.01%. Estas variables son críticas, ya que podrían estar correlacionadas con la capacidad de pago del solicitante.

Impacto de Variables Socioeconómicas:

  • Variables como DAYS_EMPLOYED, DAYS_LAST_PHONE_CHANGE, y REGION_POPULATION_RELATIVE presentan outliers más pronunciados. Esto podría indicar que ciertos grupos poblacionales o personas en situaciones específicas tienen comportamientos atípicos en términos de empleo o cambios en contacto.

  • AMT_CREDIT y AMT_INCOME_TOTAL, relacionadas con la cantidad de crédito solicitado y el ingreso anual, contienen valores extremos mínimos pero significativos. Dada su relevancia en modelos de riesgo, su tratamiento cuidadoso es clave.

Análisis de Frecuencias Altas de Outliers:

  • Las variables CNT_FAM_MEMBERS y CNT_CHILDREN, aunque no tienen un porcentaje muy elevado de valores atípicos, reflejan patrones en el tamaño de la familia y la cantidad de hijos, lo que podría indicar inconsistencias en ciertos grupos poblacionales específicos.

Por eso mismo no optamos por modificar estos valores en este modelo, ya que no parece que tengan una influencia muy significativa en nuestra variable objetivo. Pero insistimos de nuevo, en que se podrían realizar iteraciones utilizando diferentes métodos para comprobar si mejora el modelo.

Análisis de relaciones entre las variables¶

Vamos a evaluar las relaciones entre las variables numéricas del conjunto de datos mediante la matriz de correlaciones.

Correlation Matrix para variables numéricas¶

In [21]:
corr = pd.concat([pd_data_train[pd_data_num],pd_data_train['TARGET']], axis=1).corr(method='pearson')
corr
Out[21]:
COMMONAREA_AVG NONLIVINGAPARTMENTS_AVG LIVINGAPARTMENTS_AVG FLOORSMIN_AVG YEARS_BUILD_AVG OWN_CAR_AGE LANDAREA_AVG BASEMENTAREA_AVG EXT_SOURCE_1 NONLIVINGAREA_AVG ELEVATORS_AVG APARTMENTS_AVG ENTRANCES_AVG LIVINGAREA_AVG FLOORSMAX_AVG YEARS_BEGINEXPLUATATION_AVG TOTALAREA_MODE EXT_SOURCE_3 EXT_SOURCE_2 AMT_GOODS_PRICE AMT_ANNUITY DAYS_LAST_PHONE_CHANGE AMT_INCOME_TOTAL AMT_CREDIT DAYS_ID_PUBLISH DAYS_REGISTRATION DAYS_EMPLOYED DAYS_BIRTH REGION_POPULATION_RELATIVE CNT_FAM_MEMBERS CNT_CHILDREN DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_YEAR HOUR_APPR_PROCESS_START TARGET
COMMONAREA_AVG 1.000000 0.100469 0.530912 0.294396 0.230494 -0.036155 0.258204 0.403875 0.039327 0.222409 0.521230 0.536240 0.327349 0.544130 0.402248 0.086884 0.551081 -0.001335 0.053072 0.049718 0.055926 -0.006807 0.086892 0.048513 -0.000351 0.024987 -0.010976 0.004699 0.165302 -0.000420 -0.001948 -0.011888 -0.019063 -0.013479 -0.019489 -0.008359 0.007218 0.024211 -0.009351 0.005283 -0.009823 0.044369 -0.019782
NONLIVINGAPARTMENTS_AVG 0.100469 1.000000 0.155995 0.077764 0.071242 -0.025476 0.064150 0.092110 0.016657 0.218461 0.123018 0.193630 0.063206 0.135749 0.113698 0.034691 0.143505 0.006790 0.019464 0.012403 0.019505 0.000306 0.025382 0.011495 -0.003616 0.033233 -0.002648 0.000238 0.023006 0.002642 0.003799 0.002419 -0.002723 0.000265 -0.003077 -0.004075 0.000093 0.001647 0.003089 0.000649 0.000197 0.014044 -0.004158
LIVINGAPARTMENTS_AVG 0.530912 0.155995 1.000000 0.435703 0.331857 -0.045788 0.417132 0.647632 0.050335 0.288581 0.811352 0.943963 0.569113 0.879194 0.587505 0.142244 0.848336 0.004743 0.079635 0.060266 0.074095 -0.005429 0.104597 0.057289 -0.000653 0.021963 -0.021713 0.009600 0.195264 -0.006695 -0.008962 -0.015144 -0.024301 -0.016299 -0.024605 -0.004948 0.002999 0.034519 -0.007409 0.008109 -0.012161 0.076559 -0.024013
FLOORSMIN_AVG 0.294396 0.077764 0.435703 1.000000 0.358867 -0.074004 0.145948 0.219464 0.077853 0.150552 0.508580 0.441624 0.036137 0.461835 0.739743 0.162768 0.450510 0.002468 0.110622 0.077715 0.096028 -0.008299 0.136660 0.075332 -0.007402 0.019105 -0.015649 -0.000523 0.294473 -0.003171 -0.010453 -0.020188 -0.029920 -0.019772 -0.030394 -0.000438 0.005449 0.041122 -0.005084 0.005512 -0.007285 0.112511 -0.034858
YEARS_BUILD_AVG 0.230494 0.071242 0.331857 0.358867 1.000000 -0.048208 0.180048 0.248387 0.020383 0.128014 0.342473 0.340212 0.089555 0.354994 0.519770 0.477032 0.358968 0.016258 0.012544 0.041372 0.033022 0.011331 0.041509 0.035006 -0.007431 0.163101 -0.003792 0.025962 -0.053781 0.041806 0.031144 -0.010540 -0.000594 -0.011541 -0.000444 -0.003822 0.001431 -0.000726 -0.006938 0.002743 -0.024971 -0.013931 -0.020534
OWN_CAR_AGE -0.036155 -0.025476 -0.045788 -0.074004 -0.048208 1.000000 -0.017769 -0.028407 -0.086686 -0.029857 -0.063733 -0.047589 -0.014242 -0.055211 -0.082431 -0.000139 -0.057055 -0.013074 -0.079392 -0.104878 -0.099073 0.004009 -0.123613 -0.095198 0.008095 -0.025635 0.027046 0.007749 -0.080985 -0.013092 0.010799 0.007258 0.003352 0.012106 0.003248 0.002613 -0.005774 -0.021725 -0.018056 -0.008969 -0.015801 -0.069669 0.037821
LANDAREA_AVG 0.258204 0.064150 0.417132 0.145948 0.180048 -0.017769 1.000000 0.462161 0.008756 0.164469 0.370073 0.491916 0.504060 0.495390 0.216667 0.073174 0.485942 0.013631 0.021915 0.013292 0.009182 -0.003278 -0.002688 0.006431 -0.008346 0.001815 -0.009698 0.001475 -0.054471 0.000903 -0.003293 -0.003829 -0.002923 -0.003336 -0.002967 0.011277 -0.003032 0.016973 0.008446 0.007372 -0.013149 0.013557 -0.011140
BASEMENTAREA_AVG 0.403875 0.092110 0.647632 0.219464 0.248387 -0.028407 0.462161 1.000000 0.042072 0.267067 0.564962 0.679892 0.653790 0.693061 0.329318 0.086155 0.673985 0.011103 0.048498 0.042163 0.043311 -0.009265 0.014944 0.037382 -0.015720 -0.020542 -0.001156 -0.005428 0.096177 -0.001777 -0.007473 -0.011970 -0.012988 -0.013713 -0.013220 -0.003813 -0.002100 0.023288 -0.001170 0.003733 -0.013529 0.038674 -0.020987
EXT_SOURCE_1 0.039327 0.016657 0.050335 0.077853 0.020383 -0.086686 0.008756 0.042072 1.000000 0.035329 0.075366 0.056377 0.023855 0.071064 0.095811 0.004505 0.070429 0.190412 0.214980 0.176111 0.121035 -0.130435 0.023450 0.168960 -0.132497 -0.181136 0.290146 -0.601868 0.100125 -0.097436 -0.140413 -0.028141 -0.024224 -0.029206 -0.024711 -0.004562 -0.004717 0.031299 -0.004093 -0.005997 0.005880 0.033482 -0.154130
NONLIVINGAREA_AVG 0.222409 0.218461 0.288581 0.150552 0.128014 -0.029857 0.164469 0.267067 0.035329 1.000000 0.283100 0.298400 0.167951 0.301632 0.253485 0.009326 0.365069 -0.002482 0.048660 0.038534 0.046847 -0.001539 0.071301 0.034250 0.003967 0.048773 -0.010866 0.005618 0.074816 0.000517 -0.000126 -0.011817 -0.017122 -0.011388 -0.017211 -0.005866 0.005633 0.015441 -0.002197 0.004280 -0.008970 0.045419 -0.013034
ELEVATORS_AVG 0.521230 0.123018 0.811352 0.508580 0.342473 -0.063733 0.370073 0.564962 0.075366 0.283100 1.000000 0.836896 0.404818 0.867172 0.679242 0.078577 0.844945 0.008947 0.116073 0.083747 0.101445 -0.011358 0.040543 0.080827 -0.009157 -0.000804 -0.011120 -0.002623 0.280606 -0.002713 -0.007769 -0.021126 -0.030973 -0.021989 -0.031313 0.000221 0.000728 0.046409 -0.004969 0.004896 -0.019256 0.103232 -0.033765
APARTMENTS_AVG 0.536240 0.193630 0.943963 0.441624 0.340212 -0.047589 0.491916 0.679892 0.056377 0.298400 0.836896 1.000000 0.613986 0.913874 0.618257 0.100419 0.893720 0.006739 0.090627 0.063743 0.076203 -0.010119 0.031020 0.059531 -0.007416 0.012643 -0.017215 0.003071 0.206095 -0.011798 -0.013420 -0.012649 -0.022997 -0.015079 -0.023422 0.000008 0.002606 0.037665 -0.002925 0.005730 -0.017085 0.081627 -0.027657
ENTRANCES_AVG 0.327349 0.063206 0.569113 0.036137 0.089555 -0.014242 0.504060 0.653790 0.023855 0.167951 0.404818 0.613986 1.000000 0.620037 0.088954 0.041408 0.597438 0.013878 0.032956 0.019652 0.016963 -0.013513 0.004916 0.015880 -0.016777 -0.063743 0.004242 -0.012474 0.035945 -0.002363 -0.007480 -0.002820 -0.000642 -0.005582 -0.000822 0.000404 -0.005264 0.013637 0.000560 0.006350 -0.010672 0.020854 -0.018575
LIVINGAREA_AVG 0.544130 0.135749 0.879194 0.461835 0.354994 -0.055211 0.495390 0.693061 0.071064 0.301632 0.867172 0.913874 0.620037 1.000000 0.630246 0.091201 0.925154 0.006750 0.097277 0.076165 0.089473 -0.011308 0.035539 0.071231 -0.011285 0.006055 -0.012711 -0.002453 0.212895 -0.005091 -0.010141 -0.015987 -0.024911 -0.017872 -0.025228 -0.001677 0.004844 0.037872 -0.002515 0.011179 -0.019517 0.082304 -0.032207
FLOORSMAX_AVG 0.402248 0.113698 0.587505 0.739743 0.519770 -0.082431 0.216667 0.329318 0.095811 0.253485 0.679242 0.618257 0.088954 0.630246 1.000000 0.126299 0.632131 0.003654 0.135747 0.107791 0.128867 -0.007626 0.054546 0.102533 -0.009191 0.049463 -0.015921 0.000856 0.323741 -0.004221 -0.010567 -0.026768 -0.035389 -0.027018 -0.035684 -0.000060 0.005211 0.049910 -0.000972 0.004846 -0.020153 0.117631 -0.042450
YEARS_BEGINEXPLUATATION_AVG 0.086884 0.034691 0.142244 0.162768 0.477032 -0.000139 0.073174 0.086155 0.004505 0.009326 0.078577 0.100419 0.041408 0.091201 0.126299 1.000000 0.100667 0.001845 0.009301 0.009117 0.015332 0.002849 0.005454 0.008278 -0.004066 0.007685 0.008228 0.000129 -0.006471 0.006488 0.005476 -0.005378 0.000036 -0.005525 0.000165 0.005772 0.001499 -0.002404 0.002633 -0.002304 -0.008381 -0.009800 -0.009794
TOTALAREA_MODE 0.551081 0.143505 0.848336 0.450510 0.358968 -0.057055 0.485942 0.673985 0.070429 0.365069 0.844945 0.893720 0.597438 0.925154 0.632131 0.100667 1.000000 0.007334 0.094617 0.076349 0.089925 -0.009732 0.037218 0.071736 -0.010882 0.017816 -0.015775 -0.000427 0.201603 -0.003306 -0.008657 -0.016029 -0.024667 -0.017922 -0.024990 -0.001454 0.006151 0.037354 -0.003248 0.012101 -0.019690 0.077278 -0.031773
EXT_SOURCE_3 -0.001335 0.006790 0.004743 0.002468 0.016258 -0.013074 0.013631 0.011103 0.190412 -0.002482 0.008947 0.006739 0.013878 0.006750 0.003654 0.001845 0.007334 1.000000 0.110273 0.047787 0.030505 -0.074038 -0.027986 0.043289 -0.130886 -0.107668 0.112850 -0.204777 -0.006984 -0.027076 -0.042484 -0.035200 -0.000293 -0.034077 0.000059 -0.022977 -0.000365 -0.006640 -0.026066 -0.006789 -0.070741 -0.039922 -0.177583
EXT_SOURCE_2 0.053072 0.019464 0.079635 0.110622 0.012544 -0.079392 0.021915 0.048498 0.214980 0.048660 0.116073 0.090627 0.032956 0.097277 0.135747 0.009301 0.094617 0.110273 1.000000 0.140033 0.126214 -0.196768 0.055071 0.131612 -0.051859 -0.060154 -0.019304 -0.093234 0.198612 -0.002425 -0.019035 -0.031039 -0.018631 -0.032968 -0.018944 0.001939 -0.002964 0.051317 -0.002802 0.000718 -0.022717 0.156753 -0.160270
AMT_GOODS_PRICE 0.049718 0.012403 0.060266 0.077715 0.041372 -0.104878 0.013292 0.042163 0.176111 0.038534 0.083747 0.063743 0.019652 0.076165 0.107791 0.009117 0.076349 0.047787 0.140033 1.000000 0.775592 -0.076783 0.147060 0.986965 -0.010654 0.010919 -0.063384 -0.053341 0.103730 0.060779 -0.001518 -0.022161 0.001343 -0.025512 0.001323 -0.002725 -0.003477 0.057638 0.014847 0.004494 -0.051418 0.062376 -0.039155
AMT_ANNUITY 0.055926 0.019505 0.074095 0.096028 0.033022 -0.099073 0.009182 0.043311 0.121035 0.046847 0.101445 0.076203 0.016963 0.089473 0.128867 0.015332 0.089925 0.030505 0.126214 0.775592 1.000000 -0.063913 0.176525 0.770470 0.010050 0.037742 -0.103950 0.009687 0.118301 0.075295 0.021193 -0.022550 -0.010578 -0.024532 -0.010797 0.013609 0.002561 0.038942 0.008759 0.001914 -0.012079 0.052745 -0.012435
DAYS_LAST_PHONE_CHANGE -0.006807 0.000306 -0.005429 -0.008299 0.011331 0.004009 -0.003278 -0.009265 -0.130435 -0.001539 -0.011358 -0.010119 -0.013513 -0.011308 -0.007626 0.002849 -0.009732 -0.074038 -0.196768 -0.076783 -0.063913 1.000000 -0.016266 -0.073897 0.088123 0.057657 0.023036 0.082290 -0.045421 -0.027036 -0.006144 0.000704 -0.013185 0.002622 -0.012751 -0.003836 -0.002439 -0.039937 -0.000606 0.000911 -0.111318 -0.016934 0.055109
AMT_INCOME_TOTAL 0.086892 0.025382 0.104597 0.136660 0.041509 -0.123613 -0.002688 0.014944 0.023450 0.071301 0.040543 0.031020 0.004916 0.035539 0.054546 0.005454 0.037218 -0.027986 0.055071 0.147060 0.176525 -0.016266 1.000000 0.144556 0.007503 0.025130 -0.059076 0.025740 0.068677 0.014148 0.010854 -0.011837 -0.011630 -0.011894 -0.011713 0.002176 0.000592 0.022262 0.004380 0.003116 0.010377 0.033412 -0.002190
AMT_CREDIT 0.048513 0.011495 0.057289 0.075332 0.035006 -0.095198 0.006431 0.037382 0.168960 0.034250 0.080827 0.059531 0.015880 0.071231 0.102533 0.008278 0.071736 0.043289 0.131612 0.986965 0.770470 -0.073897 0.144556 1.000000 -0.008092 0.009058 -0.065374 -0.055325 0.099843 0.062696 0.002377 -0.021113 0.001149 -0.024801 0.001099 -0.002886 -0.004365 0.055544 0.014357 0.004074 -0.048971 0.052911 -0.029698
DAYS_ID_PUBLISH -0.000351 -0.003616 -0.000653 -0.007402 -0.007431 0.008095 -0.008346 -0.015720 -0.132497 0.003967 -0.009157 -0.007416 -0.016777 -0.011285 -0.009191 -0.004066 -0.010882 -0.130886 -0.051859 -0.010654 0.010050 0.088123 0.007503 -0.008092 1.000000 0.101686 -0.272392 0.272716 -0.003389 -0.020354 -0.027310 0.004245 -0.011316 0.005330 -0.010896 -0.001416 0.003361 -0.010847 -0.004276 -0.001228 -0.034659 0.030861 0.051954
DAYS_REGISTRATION 0.024987 0.033233 0.021963 0.019105 0.163101 -0.025635 0.001815 -0.020542 -0.181136 0.048773 -0.000804 0.012643 -0.063743 0.006055 0.049463 0.007685 0.017816 -0.107668 -0.060154 0.010919 0.037742 0.057657 0.025130 0.009058 0.101686 1.000000 -0.209892 0.330983 -0.054004 0.173308 0.183413 0.002167 0.007957 0.004519 0.007935 0.001060 -0.002593 -0.011527 -0.001138 0.001442 -0.023933 -0.010449 0.043475
DAYS_EMPLOYED -0.010976 -0.002648 -0.021713 -0.015649 -0.003792 0.027046 -0.009698 -0.001156 0.290146 -0.010866 -0.011120 -0.017215 0.004242 -0.012711 -0.015921 0.008228 -0.015775 0.112850 -0.019304 -0.063384 -0.103950 0.023036 -0.059076 -0.065374 -0.272392 -0.209892 1.000000 -0.616611 -0.004936 -0.232925 -0.240373 0.016665 0.006509 0.014412 0.006391 0.003387 -0.003885 -0.033226 0.012968 -0.000304 0.050394 -0.092061 -0.043668
DAYS_BIRTH 0.004699 0.000238 0.009600 -0.000523 0.025962 0.007749 0.001475 -0.005428 -0.601868 0.005618 -0.002623 0.003071 -0.012474 -0.002453 0.000856 0.000129 -0.000427 -0.204777 -0.093234 -0.053341 0.009687 0.082290 0.025740 -0.055325 0.272716 0.330983 -0.616611 1.000000 -0.030070 0.278701 0.331193 0.000727 0.005357 0.002809 0.005732 -0.000687 0.003848 0.001379 -0.009455 0.002456 -0.070873 0.091661 0.077406
REGION_POPULATION_RELATIVE 0.165302 0.023006 0.195264 0.294473 -0.053781 -0.080985 -0.054471 0.096177 0.100125 0.074816 0.280606 0.206095 0.035945 0.212895 0.323741 -0.006471 0.201603 -0.006984 0.198612 0.103730 0.118301 -0.045421 0.068677 0.099843 -0.003389 -0.054004 -0.004936 -0.030070 1.000000 -0.024800 -0.026291 0.007131 -0.011246 0.002973 -0.011759 -0.001682 -0.001431 0.078221 -0.001758 0.001420 0.000782 0.170127 -0.035827
CNT_FAM_MEMBERS -0.000420 0.002642 -0.006695 -0.003171 0.041806 -0.013092 0.000903 -0.001777 -0.097436 0.000517 -0.002713 -0.011798 -0.002363 -0.005091 -0.004221 0.006488 -0.003306 -0.027076 -0.002425 0.060779 0.075295 -0.027036 0.014148 0.062696 -0.020354 0.173308 -0.232925 0.278701 -0.024800 1.000000 0.879056 -0.004910 0.023685 -0.006535 0.024053 -0.000613 0.000281 -0.009149 -0.004341 -0.002056 -0.029068 -0.011766 0.009687
CNT_CHILDREN -0.001948 0.003799 -0.008962 -0.010453 0.031144 0.010799 -0.003293 -0.007473 -0.140413 -0.000126 -0.007769 -0.013420 -0.007480 -0.010141 -0.010567 0.005476 -0.008657 -0.042484 -0.019035 -0.001518 0.021193 -0.006144 0.010854 0.002377 -0.027310 0.183413 -0.240373 0.331193 -0.026291 0.879056 1.000000 -0.002826 0.013709 -0.002898 0.014110 -0.001563 -0.000801 -0.011591 -0.007221 -0.000592 -0.041864 -0.006374 0.019716
DEF_30_CNT_SOCIAL_CIRCLE -0.011888 0.002419 -0.015144 -0.020188 -0.010540 0.007258 -0.003829 -0.011970 -0.028141 -0.011817 -0.021126 -0.012649 -0.002820 -0.015987 -0.026768 -0.005378 -0.016029 -0.035200 -0.031039 -0.022161 -0.022550 0.000704 -0.011837 -0.021113 0.004245 0.002167 0.016665 0.000727 0.007131 -0.004910 -0.002826 1.000000 0.336766 0.860355 0.334538 -0.001304 0.001758 -0.000441 -0.001320 -0.000924 0.018832 -0.005510 0.031189
OBS_60_CNT_SOCIAL_CIRCLE -0.019063 -0.002723 -0.024301 -0.029920 -0.000594 0.003352 -0.002923 -0.012988 -0.024224 -0.017122 -0.030973 -0.022997 -0.000642 -0.024911 -0.035389 0.000036 -0.024667 -0.000293 -0.018631 0.001343 -0.010578 -0.013185 -0.011630 0.001149 -0.011316 0.007957 0.006509 0.005357 -0.011246 0.023685 0.013709 0.336766 1.000000 0.259729 0.998516 0.000333 0.000296 0.000356 0.003885 -0.002046 0.034196 -0.008900 0.008350
DEF_60_CNT_SOCIAL_CIRCLE -0.013479 0.000265 -0.016299 -0.019772 -0.011541 0.012106 -0.003336 -0.013713 -0.029206 -0.011388 -0.021989 -0.015079 -0.005582 -0.017872 -0.027018 -0.005525 -0.017922 -0.034077 -0.032968 -0.025512 -0.024532 0.002622 -0.011894 -0.024801 0.005330 0.004519 0.014412 0.002809 0.002973 -0.006535 -0.002898 0.860355 0.259729 1.000000 0.257677 -0.002330 0.000479 -0.002529 0.000147 -0.001830 0.016295 -0.008482 0.030469
OBS_30_CNT_SOCIAL_CIRCLE -0.019489 -0.003077 -0.024605 -0.030394 -0.000444 0.003248 -0.002967 -0.013220 -0.024711 -0.017211 -0.031313 -0.023422 -0.000822 -0.025228 -0.035684 0.000165 -0.024990 0.000059 -0.018944 0.001323 -0.010797 -0.012751 -0.011713 0.001099 -0.010896 0.007935 0.006391 0.005732 -0.011759 0.024053 0.014110 0.334538 0.998516 0.257677 1.000000 0.000239 0.000426 0.000255 0.004027 -0.002077 0.033832 -0.009024 0.008436
AMT_REQ_CREDIT_BUREAU_WEEK -0.008359 -0.004075 -0.004948 -0.000438 -0.003822 0.002613 0.011277 -0.003813 -0.004562 -0.005866 0.000221 0.000008 0.000404 -0.001677 -0.000060 0.005772 -0.001454 -0.022977 0.001939 -0.002725 0.013609 -0.003836 0.002176 -0.002886 -0.001416 0.001060 0.003387 -0.000687 -0.001682 -0.000613 -0.001563 -0.001304 0.000333 -0.002330 0.000239 1.000000 0.003606 -0.013516 -0.014174 0.216424 0.017445 -0.002892 0.001596
AMT_REQ_CREDIT_BUREAU_HOUR 0.007218 0.000093 0.002999 0.005449 0.001431 -0.005774 -0.003032 -0.002100 -0.004717 0.005633 0.000728 0.002606 -0.005264 0.004844 0.005211 0.001499 0.006151 -0.000365 -0.002964 -0.003477 0.002561 -0.002439 0.000592 -0.004365 0.003361 -0.002593 -0.003885 0.003848 -0.001431 0.000281 -0.000801 0.001758 0.000296 0.000479 0.000426 0.003606 1.000000 0.000012 -0.002151 0.237863 -0.004563 -0.016420 -0.000368
AMT_REQ_CREDIT_BUREAU_MON 0.024211 0.001647 0.034519 0.041122 -0.000726 -0.021725 0.016973 0.023288 0.031299 0.015441 0.046409 0.037665 0.013637 0.037872 0.049910 -0.002404 0.037354 -0.006640 0.051317 0.057638 0.038942 -0.039937 0.022262 0.055544 -0.010847 -0.011527 -0.033226 0.001379 0.078221 -0.009149 -0.011591 -0.000441 0.000356 -0.002529 0.000255 -0.013516 0.000012 1.000000 -0.008448 -0.005345 -0.004582 0.037578 -0.012187
AMT_REQ_CREDIT_BUREAU_QRT -0.009351 0.003089 -0.007409 -0.005084 -0.006938 -0.018056 0.008446 -0.001170 -0.004093 -0.002197 -0.004969 -0.002925 0.000560 -0.002515 -0.000972 0.002633 -0.003248 -0.026066 -0.002802 0.014847 0.008759 -0.000606 0.004380 0.014357 -0.004276 -0.001138 0.012968 -0.009455 -0.001758 -0.004341 -0.007221 -0.001320 0.003885 0.000147 0.004027 -0.014174 -0.002151 -0.008448 1.000000 -0.003376 0.071375 0.000488 -0.002705
AMT_REQ_CREDIT_BUREAU_DAY 0.005283 0.000649 0.008109 0.005512 0.002743 -0.008969 0.007372 0.003733 -0.005997 0.004280 0.004896 0.005730 0.006350 0.011179 0.004846 -0.002304 0.012101 -0.006789 0.000718 0.004494 0.001914 0.000911 0.003116 0.004074 -0.001228 0.001442 -0.000304 0.002456 0.001420 -0.002056 -0.000592 -0.000924 -0.002046 -0.001830 -0.002077 0.216424 0.237863 -0.005345 -0.003376 1.000000 -0.002481 0.000177 0.003584
AMT_REQ_CREDIT_BUREAU_YEAR -0.009823 0.000197 -0.012161 -0.007285 -0.024971 -0.015801 -0.013149 -0.013529 0.005880 -0.008970 -0.019256 -0.017085 -0.010672 -0.019517 -0.020153 -0.008381 -0.019690 -0.070741 -0.022717 -0.051418 -0.012079 -0.111318 0.010377 -0.048971 -0.034659 -0.023933 0.050394 -0.070873 0.000782 -0.029068 -0.041864 0.018832 0.034196 0.016295 0.033832 0.017445 -0.004563 -0.004582 0.071375 -0.002481 1.000000 -0.030087 0.017556
HOUR_APPR_PROCESS_START 0.044369 0.014044 0.076559 0.112511 -0.013931 -0.069669 0.013557 0.038674 0.033482 0.045419 0.103232 0.081627 0.020854 0.082304 0.117631 -0.009800 0.077278 -0.039922 0.156753 0.062376 0.052745 -0.016934 0.033412 0.052911 0.030861 -0.010449 -0.092061 0.091661 0.170127 -0.011766 -0.006374 -0.005510 -0.008900 -0.008482 -0.009024 -0.002892 -0.016420 0.037578 0.000488 0.000177 -0.030087 1.000000 -0.022306
TARGET -0.019782 -0.004158 -0.024013 -0.034858 -0.020534 0.037821 -0.011140 -0.020987 -0.154130 -0.013034 -0.033765 -0.027657 -0.018575 -0.032207 -0.042450 -0.009794 -0.031773 -0.177583 -0.160270 -0.039155 -0.012435 0.055109 -0.002190 -0.029698 0.051954 0.043475 -0.043668 0.077406 -0.035827 0.009687 0.019716 0.031189 0.008350 0.030469 0.008436 0.001596 -0.000368 -0.012187 -0.002705 0.003584 0.017556 -0.022306 1.000000

A través de esta tabla podemos observar la correlación entre estas variables. Las variables como LIVINGAPARTMENTS_AVG y LIVINGAREA_AVG muestran correlaciones altas, indicando posible redundancia entre estas variables. Pero para que sea más visual, vamos a ilustrar gráficamente las correlaciones

In [22]:
f.plot_correlation_heatmap(corr)
Out[22]:
(<Figure size 1400x1200 with 2 Axes>,
 <Axes: title={'center': 'Matriz de correlaciones'}>)
No description has been provided for this image

En algunos algoritmos, como los ensembling de árboles (xgboost, randomforest, lightgbm, catboost, ...) no es necesario eliminar colinealidad entre variables. En otros algoritmos como glm si es necesario eliminar la colinealidad, por lo que mostramos aquí una lista de las variables más colineales, en caso de que fuera necesario tenerla en cuenta.

In [23]:
corr = pd_data_train[pd_data_num].corr('pearson')
new_corr = corr.abs()
new_corr.loc[:,:] = np.tril(new_corr, k=-1) # below main lower triangle of an array
new_corr = new_corr.stack().to_frame('correlation').reset_index().sort_values(by='correlation', ascending=False)
new_corr[new_corr['correlation']>0.6]
Out[23]:
level_0 level_1 correlation
1460 OBS_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE 0.998516
985 AMT_CREDIT AMT_GOODS_PRICE 0.986965
464 APARTMENTS_AVG LIVINGAPARTMENTS_AVG 0.943963
685 TOTALAREA_MODE LIVINGAREA_AVG 0.925154
557 LIVINGAREA_AVG APARTMENTS_AVG 0.913874
683 TOTALAREA_MODE APARTMENTS_AVG 0.893720
548 LIVINGAREA_AVG LIVINGAPARTMENTS_AVG 0.879194
1289 CNT_CHILDREN CNT_FAM_MEMBERS 0.879056
556 LIVINGAREA_AVG ELEVATORS_AVG 0.867172
1417 DEF_60_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE 0.860355
674 TOTALAREA_MODE LIVINGAPARTMENTS_AVG 0.848336
682 TOTALAREA_MODE ELEVATORS_AVG 0.844945
472 APARTMENTS_AVG ELEVATORS_AVG 0.836896
422 ELEVATORS_AVG LIVINGAPARTMENTS_AVG 0.811352
859 AMT_ANNUITY AMT_GOODS_PRICE 0.775592
986 AMT_CREDIT AMT_ANNUITY 0.770470
591 FLOORSMAX_AVG FLOORSMIN_AVG 0.739743
553 LIVINGAREA_AVG BASEMENTAREA_AVG 0.693061
469 APARTMENTS_AVG BASEMENTAREA_AVG 0.679892
598 FLOORSMAX_AVG ELEVATORS_AVG 0.679242
679 TOTALAREA_MODE BASEMENTAREA_AVG 0.673985
511 ENTRANCES_AVG BASEMENTAREA_AVG 0.653790
296 BASEMENTAREA_AVG LIVINGAPARTMENTS_AVG 0.647632
686 TOTALAREA_MODE FLOORSMAX_AVG 0.632131
601 FLOORSMAX_AVG LIVINGAREA_AVG 0.630246
558 LIVINGAREA_AVG ENTRANCES_AVG 0.620037
599 FLOORSMAX_AVG APARTMENTS_AVG 0.618257
1160 DAYS_BIRTH DAYS_EMPLOYED 0.616611
515 ENTRANCES_AVG APARTMENTS_AVG 0.613986
1142 DAYS_BIRTH EXT_SOURCE_1 0.601868

Convertimos los valores absolutos de la matriz de correlación en una lista de pares de variables, y filtramos las correlaciones mayores a 0.6 para identificar aquellas variables que podrían causar multicolinealidad.

En el análisis de correlación, identificamos varias variables con relaciones positivas muy fuertes, destacando su redundancia potencial. Por ejemplo, OBS_30_CNT_SOCIAL_CIRCLE y OBS_60_CNT_SOCIAL_CIRCLE tienen una correlación de 0.998, lo que indica que ambas miden prácticamente lo mismo sobre las conexiones sociales del cliente. Asimismo, AMT_CREDIT y AMT_GOODS_PRICE (0.987) reflejan que el monto del crédito solicitado está estrechamente vinculado al precio del bien adquirido. Por otro lado, variables relacionadas con propiedades, como APARTMENTS_AVG y LIVINGAPARTMENTS_AVG (0.944), o TOTALAREA_MODE y LIVINGAREA_AVG (0.925), describen características similares sobre las áreas habitables. Por último, CNT_CHILDREN y CNT_FAM_MEMBERS (0.879) resaltan cómo el número de hijos influye directamente en el tamaño de la familia.

Tratamiento de valores nulos¶

Vamos a realizar un análisis descriptivo para identificar qué tan incompletos están los datos, tanto a nivel de columnas como de filas. Esto es crucial para decidir qué acciones tomar en el preprocesamiento, como eliminar filas o columnas con demasiados valores nulos o aplicar estrategias de imputación.

¿Son todos los nulos de una clase de la variable objetivo? o tienen el mismo porcentaje de la variable objetivo?

In [24]:
pd_series_null_columns = pd_data.isnull().sum().sort_values(ascending=False)
pd_series_null_rows = pd_data.isnull().sum(axis=1).sort_values(ascending=False)
print(pd_series_null_columns.shape, pd_series_null_rows.shape)
(93,) (307511,)

Vamos a generar dos series, una que resume los valores nulos por columna (pd_series_null_columns) y otra por fila (pd_series_null_rows). Las columnas y filas se ordenaron de mayor a menor cantidad de valores faltantes, permitiendo visualizar rápidamente qué variables son más incompletas.

In [25]:
pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])     
pd_null_filas = pd.DataFrame(pd_series_null_rows, columns=['nulos_filas'])  
pd_null_filas['target'] = pd_data['TARGET'].copy()
pd_null_columnas['porcentaje_columnas'] = pd_null_columnas['nulos_columnas']/pd_data.shape[0]
pd_null_filas['porcentaje_filas']= pd_null_filas['nulos_filas']/pd_data.shape[1]
In [26]:
pd_null_columnas[pd_null_columnas['nulos_columnas'] > 0]
Out[26]:
nulos_columnas porcentaje_columnas
COMMONAREA_AVG 214865 0.698723
NONLIVINGAPARTMENTS_AVG 213514 0.694330
FONDKAPREMONT_MODE 210295 0.683862
LIVINGAPARTMENTS_AVG 210199 0.683550
FLOORSMIN_AVG 208642 0.678486
YEARS_BUILD_AVG 204488 0.664978
OWN_CAR_AGE 202929 0.659908
LANDAREA_AVG 182590 0.593767
BASEMENTAREA_AVG 179943 0.585160
EXT_SOURCE_1 173378 0.563811
NONLIVINGAREA_AVG 169682 0.551792
ELEVATORS_AVG 163891 0.532960
WALLSMATERIAL_MODE 156341 0.508408
APARTMENTS_AVG 156061 0.507497
ENTRANCES_AVG 154828 0.503488
LIVINGAREA_AVG 154350 0.501933
HOUSETYPE_MODE 154297 0.501761
FLOORSMAX_AVG 153020 0.497608
YEARS_BEGINEXPLUATATION_AVG 150007 0.487810
TOTALAREA_MODE 148431 0.482685
OCCUPATION_TYPE 96391 0.313455
EXT_SOURCE_3 60965 0.198253
ORGANIZATION_TYPE 55374 0.180072
AMT_REQ_CREDIT_BUREAU_HOUR 41519 0.135016
AMT_REQ_CREDIT_BUREAU_DAY 41519 0.135016
AMT_REQ_CREDIT_BUREAU_WEEK 41519 0.135016
AMT_REQ_CREDIT_BUREAU_MON 41519 0.135016
AMT_REQ_CREDIT_BUREAU_QRT 41519 0.135016
AMT_REQ_CREDIT_BUREAU_YEAR 41519 0.135016
NAME_TYPE_SUITE 1292 0.004201
OBS_60_CNT_SOCIAL_CIRCLE 1021 0.003320
OBS_30_CNT_SOCIAL_CIRCLE 1021 0.003320
DEF_30_CNT_SOCIAL_CIRCLE 1021 0.003320
DEF_60_CNT_SOCIAL_CIRCLE 1021 0.003320
EXT_SOURCE_2 660 0.002146
AMT_GOODS_PRICE 278 0.000904
AMT_ANNUITY 12 0.000039
CODE_GENDER 4 0.000013
CNT_FAM_MEMBERS 2 0.000007
DAYS_LAST_PHONE_CHANGE 1 0.000003

Vamos a ver la distribución de los nulos con respecto a la variable objetivo, para ver si nos está aportando información relevante o los podemos sustituir. En este caso no eliminamos las filas con valores nulos, porque una gran cantidad de filas los contienen.

In [27]:
f.get_percent_null_values_target(pd_data_train, pd_data_num, target='TARGET')
Out[27]:
TARGET 0.0 1.0 variable sum_null_values porcentaje_sum_null_values
TARGET 0.0 1.0
0 0.914195 0.085805 COMMONAREA_AVG 171669 0.697819
1 0.914040 0.085960 NONLIVINGAPARTMENTS_AVG 170615 0.693534
2 0.913672 0.086328 LIVINGAPARTMENTS_AVG 167940 0.682661
3 0.913639 0.086361 FLOORSMIN_AVG 166719 0.677697
4 0.913240 0.086760 YEARS_BUILD_AVG 163417 0.664275
5 0.915260 0.084740 OWN_CAR_AGE 162520 0.660629
6 0.911542 0.088458 LANDAREA_AVG 145810 0.592704
7 0.910606 0.089394 BASEMENTAREA_AVG 143679 0.584042
8 0.914834 0.085166 EXT_SOURCE_1 138647 0.563587
9 0.909419 0.090581 NONLIVINGAREA_AVG 135581 0.551124
10 0.908887 0.091113 ELEVATORS_AVG 130881 0.532019
11 0.908144 0.091856 APARTMENTS_AVG 124597 0.506475
12 0.907972 0.092028 ENTRANCES_AVG 123680 0.502748
13 0.908218 0.091782 LIVINGAREA_AVG 123248 0.500992
14 0.907850 0.092150 FLOORSMAX_AVG 122224 0.496829
15 0.907652 0.092348 YEARS_BEGINEXPLUATATION_AVG 119808 0.487009
16 0.907417 0.092583 TOTALAREA_MODE 118542 0.481862
17 0.906160 0.093840 EXT_SOURCE_3 48913 0.198827
18 0.918095 0.081905 EXT_SOURCE_2 525 0.002134
19 0.922747 0.077253 AMT_GOODS_PRICE 233 0.000947
20 1.000000 NaN AMT_ANNUITY 10 0.000041
21 1.000000 NaN DAYS_LAST_PHONE_CHANGE 1 0.000004
22 1.000000 NaN CNT_FAM_MEMBERS 2 0.000008
23 0.968637 0.031363 DEF_30_CNT_SOCIAL_CIRCLE 829 0.003370
24 0.968637 0.031363 OBS_60_CNT_SOCIAL_CIRCLE 829 0.003370
25 0.968637 0.031363 DEF_60_CNT_SOCIAL_CIRCLE 829 0.003370
26 0.968637 0.031363 OBS_30_CNT_SOCIAL_CIRCLE 829 0.003370
27 0.895681 0.104319 AMT_REQ_CREDIT_BUREAU_WEEK 33340 0.135524
28 0.895681 0.104319 AMT_REQ_CREDIT_BUREAU_HOUR 33340 0.135524
29 0.895681 0.104319 AMT_REQ_CREDIT_BUREAU_MON 33340 0.135524
30 0.895681 0.104319 AMT_REQ_CREDIT_BUREAU_QRT 33340 0.135524
31 0.895681 0.104319 AMT_REQ_CREDIT_BUREAU_DAY 33340 0.135524
32 0.895681 0.104319 AMT_REQ_CREDIT_BUREAU_YEAR 33340 0.135524

Como los nulos en la mayoría de las variables se distribuyen de forma proporcional respecto a la variable objetivo (recordemos que era un 91.93% de ceros y un 8.07% de unos), y después de comprobar que la correlación se mantiene estable, hemos decidicido sustituir los valores missing por la mediana. Las únicas variables que se distribuyen de froma asimétrica son AMT_ANNUITY, DAYS_LAST_PHONE_CHANGE y CNT_FAM_MEMBERS, y como en comparación con el total de los datos son una cantidad insignificante de NA, tampoco afecta a la correlación con la variable target el hecho de sustituir por la mediana.

In [28]:
pd_data_train[pd_data_num] = pd_data_train[pd_data_num].apply(lambda x: x.fillna(x.median()))
pd_data_test[pd_data_num] = pd_data_test[pd_data_num].fillna(pd_data_train[pd_data_num].median())

Volvemos a aplicar la función para verificar que ya no hay valores nulos en las variables numéricas.

In [29]:
f.get_percent_null_values_target(pd_data_train, pd_data_num, target='TARGET')
No existen variables con valores nulos
Out[29]:

Volvemos a visualizar la matriz de correlaciones para ver si se ha modificado tras la imputación de los valores nulos.

In [30]:
corr = pd.concat([pd_data_train[pd_data_num],pd_data_train['TARGET']], axis=1).corr(method='pearson')
f.plot_correlation_heatmap(corr)
Out[30]:
(<Figure size 1400x1200 with 2 Axes>,
 <Axes: title={'center': 'Matriz de correlaciones'}>)
No description has been provided for this image
In [31]:
corr.loc['TARGET'].sort_values(ascending=False)
Out[31]:
TARGET                         1.000000
DAYS_BIRTH                     0.077406
DAYS_LAST_PHONE_CHANGE         0.055109
DAYS_ID_PUBLISH                0.051954
DAYS_REGISTRATION              0.043475
DEF_30_CNT_SOCIAL_CIRCLE       0.031356
DEF_60_CNT_SOCIAL_CIRCLE       0.030611
CNT_CHILDREN                   0.019716
OWN_CAR_AGE                    0.016371
AMT_REQ_CREDIT_BUREAU_YEAR     0.009758
CNT_FAM_MEMBERS                0.009688
OBS_30_CNT_SOCIAL_CIRCLE       0.008783
OBS_60_CNT_SOCIAL_CIRCLE       0.008697
AMT_REQ_CREDIT_BUREAU_DAY      0.002466
AMT_REQ_CREDIT_BUREAU_WEEK    -0.000659
AMT_REQ_CREDIT_BUREAU_HOUR    -0.001306
AMT_INCOME_TOTAL              -0.002190
YEARS_BEGINEXPLUATATION_AVG   -0.004636
AMT_REQ_CREDIT_BUREAU_QRT     -0.006440
NONLIVINGAPARTMENTS_AVG       -0.006499
YEARS_BUILD_AVG               -0.010426
LANDAREA_AVG                  -0.012367
AMT_ANNUITY                   -0.012433
AMT_REQ_CREDIT_BUREAU_MON     -0.014678
ENTRANCES_AVG                 -0.015639
BASEMENTAREA_AVG              -0.016691
COMMONAREA_AVG                -0.016894
NONLIVINGAREA_AVG             -0.018123
LIVINGAPARTMENTS_AVG          -0.018872
FLOORSMIN_AVG                 -0.021834
HOUR_APPR_PROCESS_START       -0.022306
APARTMENTS_AVG                -0.025724
LIVINGAREA_AVG                -0.029194
AMT_CREDIT                    -0.029698
TOTALAREA_MODE                -0.029837
ELEVATORS_AVG                 -0.035738
REGION_POPULATION_RELATIVE    -0.035827
FLOORSMAX_AVG                 -0.038597
AMT_GOODS_PRICE               -0.039133
DAYS_EMPLOYED                 -0.043668
EXT_SOURCE_1                  -0.098159
EXT_SOURCE_3                  -0.154409
EXT_SOURCE_2                  -0.160081
Name: TARGET, dtype: float64

Las correlaciones encontradas son en su mayoría débiles, lo que sugiere que ninguna variable tiene un impacto muy fuerte sobre la variable objetivo por sí sola. Sin embargo, combinaciones de estas variables podrían proporcionar información más útil en modelos predictivos.

Tratamiento de las variables categoricas¶

Vamos a convertir las variables categoricas en numericas para después obtener la correlación gracias a la correlación de spearman y el coeficiente V-Cramer.

Se han procesado las variables booleanas y categóricas para calcular su correlación con la variable objetivo TARGET.

In [32]:
pd_data_train[pd_data_bool].head()
Out[32]:
EMERGENCYSTATE_MODE CODE_GENDER FLAG_DOCUMENT_10 FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_11 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 FLAG_OWN_REALTY FLAG_OWN_CAR FLAG_DOCUMENT_12 REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY FLAG_EMAIL FLAG_PHONE FLAG_CONT_MOBILE FLAG_WORK_PHONE FLAG_EMP_PHONE FLAG_MOBIL TARGET
SK_ID_CURR
172963 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0
423589 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0
392520 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0
359127 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0
431456 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
In [33]:
corr_bool = pd_data_train[pd_data_bool].corr(method='pearson')
f.plot_correlation_heatmap(corr_bool)
Out[33]:
(<Figure size 1400x1200 with 2 Axes>,
 <Axes: title={'center': 'Matriz de correlaciones'}>)
No description has been provided for this image

La matriz de correlación muestra correlaciones altas entre variables como REG_CITY_NOT_WORK_CITY y LIVE_CITY_NOT_WORK_CITY (0.9), lo que sugiere redundancia, ya que ambas representan relaciones similares entre residencia y trabajo. Sin embargo, las correlaciones con la variable objetivo (TARGET) son bajas.

Otro dato interesante que podemos extraer de la gráfica, aunque no necesariamente relevante respecto a la variable target, es que FLAG_DOCUMENT_6 presenta una correlación negativa significativa (-0.6) con FLAG_EMP_PHONE, por lo que probablemente este documento contenga datos personales del cliente, entre ellos su número de teléfono.

Tambien hemos procesado las variables categóricas (no booleanas), transformadas en formato numérico para calcular sus correlaciones, utilizando una función que itera la función cramers_v de los apuntes.

In [34]:
corr_cats = f.corr_cat(df=pd_data_train,target='TARGET',target_transform=True)
corr_cats
Out[34]:
WALLSMATERIAL_MODE HOUSETYPE_MODE OCCUPATION_TYPE ORGANIZATION_TYPE NAME_TYPE_SUITE CODE_GENDER REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START NAME_INCOME_TYPE REGION_RATING_CLIENT NAME_HOUSING_TYPE NAME_FAMILY_STATUS NAME_EDUCATION_TYPE TARGET
WALLSMATERIAL_MODE 1.000000 0.106949 0.026653 0.030347 0.008646 0.015555 0.098011 0.003890 0.024058 0.105330 0.035866 0.008624 0.036321 0.027575
HOUSETYPE_MODE 0.106949 1.000000 0.022958 0.021435 0.000000 0.007973 0.019517 0.000000 0.016142 0.025076 0.016962 0.010240 0.016470 0.011329
OCCUPATION_TYPE 0.026653 0.022958 1.000000 0.301776 0.020917 0.572457 0.057319 0.019195 0.144951 0.057187 0.032623 0.058650 0.219109 0.080576
ORGANIZATION_TYPE 0.030347 0.021435 0.301776 1.000000 0.018956 0.323213 0.093888 0.019775 0.203516 0.094917 0.067417 0.049357 0.116218 0.056873
NAME_TYPE_SUITE 0.008646 0.000000 0.020917 0.018956 1.000000 0.063592 0.024091 0.016191 0.021888 0.024646 0.018756 0.067369 0.022795 0.009530
CODE_GENDER 0.015555 0.007973 0.572457 0.323213 0.063592 0.999991 0.020063 0.006377 0.170464 0.019900 0.066897 0.167912 0.026127 0.055378
REGION_RATING_CLIENT_W_CITY 0.098011 0.019517 0.057319 0.093888 0.024091 0.020063 1.000000 0.017709 0.130031 0.956621 0.086834 0.023566 0.071797 0.060857
WEEKDAY_APPR_PROCESS_START 0.003890 0.000000 0.019195 0.019775 0.016191 0.006377 0.017709 1.000000 0.013009 0.018754 0.004755 0.003939 0.005417 0.004307
NAME_INCOME_TYPE 0.024058 0.016142 0.144951 0.203516 0.021888 0.170464 0.130031 0.013009 1.000000 0.136369 0.054748 0.113153 0.104543 0.062831
REGION_RATING_CLIENT 0.105330 0.025076 0.057187 0.094917 0.024646 0.019900 0.956621 0.018754 0.136369 1.000000 0.087524 0.022782 0.071932 0.059122
NAME_HOUSING_TYPE 0.035866 0.016962 0.032623 0.067417 0.018756 0.066897 0.086834 0.004755 0.054748 0.087524 1.000000 0.067920 0.041843 0.036127
NAME_FAMILY_STATUS 0.008624 0.010240 0.058650 0.049357 0.067369 0.167912 0.023566 0.003939 0.113153 0.022782 0.067920 1.000000 0.052976 0.039689
NAME_EDUCATION_TYPE 0.036321 0.016470 0.219109 0.116218 0.022795 0.026127 0.071797 0.005417 0.104543 0.071932 0.041843 0.052976 1.000000 0.056687
TARGET 0.027575 0.011329 0.080576 0.056873 0.009530 0.055378 0.060857 0.004307 0.062831 0.059122 0.036127 0.039689 0.056687 0.999973
In [35]:
plt.figure(figsize=(8,4))
sns.heatmap(corr_cats, annot=True, fmt='.3f', cmap='YlOrRd', annot_kws={"size": 8})
plt.title('Cramers V Matrix', fontdict={'size':'17'})
plt.show()
No description has been provided for this image

Esta matriz de Cramér's V revela información interesante sobre las relaciones entre las variables categóricas y la variable objetivo TARGET en el dataset.

Se destacan correlaciones moderadas entre algunas variables, como CODE_GENDER y ORGANIZATION_TYPE (0.572) o REGION_RATING_CLIENT y REGION_RATING_CLIENT_W_CITY (0.957). Estas correlaciones sugieren que algunas variables comparten información o describen aspectos relacionados de los clientes, como su género o calificaciones asociadas a la región de residencia. Esto indica relaciones lógicas predecibles entre las variables.

Las correlaciones con la variable TARGET son en general bajas (menores a 0.1), lo que implica que estas variables categóricas, aunque relevantes, no tienen un impacto directo o individual significativo en el modelo.

Tratamiento de valores nulos¶

En las variables categóricas, los valores nulos se suelen sustituir por una nueva clase: "sin valor" o por la moda

In [36]:
pd_data_train[pd_data_bool].isna().sum()
Out[36]:
EMERGENCYSTATE_MODE            0
CODE_GENDER                    4
FLAG_DOCUMENT_10               0
FLAG_DOCUMENT_2                0
FLAG_DOCUMENT_3                0
FLAG_DOCUMENT_4                0
FLAG_DOCUMENT_5                0
FLAG_DOCUMENT_6                0
FLAG_DOCUMENT_7                0
FLAG_DOCUMENT_8                0
FLAG_DOCUMENT_9                0
FLAG_DOCUMENT_11               0
FLAG_DOCUMENT_13               0
FLAG_DOCUMENT_14               0
FLAG_DOCUMENT_15               0
FLAG_DOCUMENT_16               0
FLAG_DOCUMENT_17               0
FLAG_DOCUMENT_18               0
FLAG_DOCUMENT_19               0
FLAG_DOCUMENT_20               0
FLAG_DOCUMENT_21               0
FLAG_OWN_REALTY                0
FLAG_OWN_CAR                   0
FLAG_DOCUMENT_12               0
REG_REGION_NOT_LIVE_REGION     0
REG_REGION_NOT_WORK_REGION     0
LIVE_REGION_NOT_WORK_REGION    0
REG_CITY_NOT_LIVE_CITY         0
REG_CITY_NOT_WORK_CITY         0
LIVE_CITY_NOT_WORK_CITY        0
FLAG_EMAIL                     0
FLAG_PHONE                     0
FLAG_CONT_MOBILE               0
FLAG_WORK_PHONE                0
FLAG_EMP_PHONE                 0
FLAG_MOBIL                     0
TARGET                         0
dtype: int64
In [37]:
missing_gender = pd_data_train[pd_data_train['CODE_GENDER'].isnull()]
target_dist = missing_gender['TARGET']
print("Target distribution for missing code_gender:")
print(target_dist)
Target distribution for missing code_gender:
SK_ID_CURR
144669    0
319880    0
141289    0
196708    0
Name: TARGET, dtype: int64
In [38]:
pd_data_train['CODE_GENDER'] = pd_data_train['CODE_GENDER'].astype('category').cat.add_categories('-1').fillna('-1')

Podemos observar que las variables booleanas, en general, no presentan valores nulos, con la excepción de la variable CODE_GENDER, que tiene únicamente 4 valores faltantes. Dado que este número es pequeño, se ha decidido eliminar las cuatro filas que contienen esos valores, ya que representan un porcetaje insignificante del total de filas, y ninguno tiene valor 1 en la variable objetivo. Esto nos permite seguir considerando la variable como boolena, y ahorrarnos una nueva categoría que sólo se presentaría 4 veces.

En las variables categóricas generales, se observan varias columnas con cantidades considerables de valores nulos:

In [39]:
pd_data_train[pd_data_cat].isna().sum()
Out[39]:
FONDKAPREMONT_MODE             168050
WALLSMATERIAL_MODE             124875
HOUSETYPE_MODE                 123196
OCCUPATION_TYPE                 77126
ORGANIZATION_TYPE               44460
NAME_TYPE_SUITE                  1050
REGION_RATING_CLIENT_W_CITY         0
WEEKDAY_APPR_PROCESS_START          0
NAME_INCOME_TYPE                    0
REGION_RATING_CLIENT                0
NAME_HOUSING_TYPE                   0
NAME_FAMILY_STATUS                  0
NAME_EDUCATION_TYPE                 0
NAME_CONTRACT_TYPE                  0
dtype: int64
In [40]:
for col in pd_data_cat:
        pd_data_train[col] = pd_data_train[col].astype('category').cat.add_categories('Sin valor').fillna('Sin valor')

En las variables categóricas generales se encuentran variables con cantidades significativas de valores nulos, como: FONDKAPREMONT_MODE (168,050 valores nulos), WALLSMATERIAL_MODE (124,875 valores nulos), HOUSETYPE_MODE (123,196 valores nulos). Otras variables como NAME_TYPE_SUITE presentan menos valores nulos (1,050). Optamos por cambiarlos a una nueva categoría Sin valor, ya que al ser tantos, sustituirlos por la moda podría afectar al funcinamiento del modelo.

Exportación parcial del DataSet dividido en Train y Test¶

Para acabar el notebook 02, procedemos a exportar a dos archivos CSV (test_pd_data_preprocessing_missing_outliers.csv y train_pd_data_preprocessing_missing_outliers.csv) el contenido del dataset ya dividido en Train y Test.

In [41]:
pd_data_train.to_csv("../data/train_pd_data_preprocessing_missing_outlier.csv")
pd_data_test.to_csv("../data/test_pd_data_preprocessing_missing_outlier.csv")